WilcoxCV: an R package for fast variable selection in cross-validation

نویسنده

  • Anne-Laure Boulesteix
چکیده

UNLABELLED In the last few years, numerous methods have been proposed for microarray-based class prediction. Although many of them have been designed especially for the case n << p (much more variables than observations), preliminary variable selection is almost always necessary when the number of genes reaches several tens of thousands, as usual in recent data sets. In the two-class setting, the Wilcoxon rank sum test statistic is, with the t-statistic, one of the standard approaches for variable selection. It is well known that the variable selection step must be seen as a part of classifier construction and, as such, be performed based on training data only. When classifier accuracy is evaluated via cross-validation or Monte-Carlo cross-validation, it means that we have to perform p Wilcoxon or t-tests for each iteration, which becomes a daunting task for increasing p. As a consequence, many authors often perform variable selection only once using all the available data, which can induce a dramatic underestimation of error rate and thus lead to misleadingly reporting predictive power. We propose a very fast implementation of variable selection based on the Wilcoxon test for use in cross-validation and Monte Carlo cross-validation (also known as random splitting into learning and test sets). This implementation is based on a simple mathematical formula using only the ranks calculated from the original data set. AVAILABILITY Our method is implemented in the freely available R package WilcoxCV which can be downloaded from the Comprehensive R Archive Network at http://cran.r-project.org/src/contrib/Descriptions/WilcoxCV.html.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Title Model Checking and Variable Selection in Nonparametric Regression

February 19, 2015 Type Package Title Model Checking and Variable Selection in Nonparametric Regression Version 1.0 Date 2012-08-03 Author Adriano Zanin Zambom Maintainer Adriano Zanin Zambom Depends R (>= 2.15.0), dr, MASS, graphics Description This package provides tests of significance for covariates (or groups of covariates) in a fully nonparametric regression mode...

متن کامل

CVThresh: R Package for Level-Dependent Cross-Validation Thresholding

The core of the wavelet approach to nonparametric regression is thresholding of wavelet coefficients. This paper reviews a cross-validation method for the selection of the thresholding value in wavelet shrinkage of Oh, Kim, and Lee (2006), and introduces the R package CVThresh implementing details of the calculations for the procedures. This procedure is implemented by coupling a conventional c...

متن کامل

abc: an R package for Approximate Bayesian Computation (ABC)

Background: Many recent statistical applications involve inference under complex models, where it is computationally prohibitive to calculate likelihoods but possible to simulate data. Approximate Bayesian Computation (ABC) is devoted to these complex models because it bypasses evaluations of the likelihood function using comparisons between observed and simulated summary statistics. Results: W...

متن کامل

Fast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets

Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...

متن کامل

The R Package groc for Generalized Regression on Orthogonal Components

The R package groc for generalized regression on orthogonal components contains functions for the prediction of q responses using a set of p predictors. The primary building block is the grid algorithm used to search for components (projections of the data) which are most dependent on the response. The package offers flexibility in the choice of the dependence measure which can be user-defined....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 23 13  شماره 

صفحات  -

تاریخ انتشار 2007